Project 6: Credit Card Users Churn Prediction

Part 1: Setting up the Project

Background

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged on every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers’ and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help bank improve their services so that customers do not renounce their credit cards.

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary

  1. CLIENTNUM: Client number. Unique identifier for the customer holding the account
  2. Attrition_Flag: Internal event (customer activity) variable - if the account is closed then 1 else 0
  3. Customer_Age: Age in Years
  4. Gender: Gender of the account holder
  5. Dependent_count: Number of dependents
  6. Education_Level: Educational Qualification of the account holder
  7. Marital_Status: Marital Status of the account holder
  8. Income_Category: Annual Income Category of the account holder
  9. Card_Category: Type of Card
  10. Months_on_book: Period of relationship with the bank
  11. Total_Relationship_Count: Total no. of products held by the customer
  12. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  13. Contacts_Count_12_mon: No. of Contacts in the last 12 months
  14. Credit_Limit: Credit Limit on the Credit Card
  15. Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  16. Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  17. Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  18. Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  19. Total_Trans_Ct: Total Transaction Count (Last 12 months)
  20. Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  21. Avg_Utilization_Ratio: Average Card Utilization Ratio

Importing the necessary libraries...

Part 2: Initial Exploratory Data Analysis

We can see from the .shape function that there are 10,127 rows in the dataset. The describe function shows that all the counts = 10127; which means that there are no null values in the dataset. However, this does not mean that there are 0s or other place holders that represent missing values. We can see that there are a number of categorical variables, such as Gender, Education Level, Marital Status, etc. The dependent variable - the variable of interest - Attrition_Flag is in binary format according to the data dictionary, but we see above it is categorical. We will have to check there are only two values and we can convert to binary.

CustomerID appears to be a unique identifier, which would me we could drop this column. From the .describe function, we can see that a lot of the numerical variables have 0s as the minimum. We will have to decide whether these zeros are legitimate values or missing values that need to be replaced with different values (ie median or mean).

Checking for missing values and/or duplicates...

There does not appear to be any duplicate rows. At this point, we will not any drop rows, but we may need to if there are rows with important information missing that cannot be easily replaced. Lets look at the datatypes of each column...

There are 21 categories of data and 10127 rows. Five additional columns appeared using .info() that were not displayed using .describe. This is because they are in object format, meaning they are also categorical variables. Before we make any changes to the dataset, lets create a copy called 'TB' - short for Thera Bank - this will allow us to keep the original dataset in case we need to recall it for some reason and gives us a shorthand name for easy manipulation.

Part 3: Cleaning the Data

The first step is to remove the columns that do not provide utility. One such column is CLIENTNUM. CLIENTNUM is a unique key, but considering that it is simply a numbering for the clients we can remove this column and rely on the index instead to serve the same purpose. We could see using .describe, that the numbers are not significant and appear randomly assigned, this was confirmed in the data dictionary. By dropping this column, python will not perform unuseful analysis on this column (see .describe() above, for example).

Renaming Columns

Our next step is to rename some of the columns. Many of the column names are long and can be shortened for easier EDA. This is merely a preference and will not affect the model.

We now have columns names that are shorter, easier to work with, and in some cases, easier to understand. Some of the previous names, such as 'Total_Ct_Chng_Q4_Q1' were difficult to grasp.

Attrition Flag

Attition Flag has only 2 distinct values. There are 0 missing. We see that the two options are Exisiting Customer or Attrited Customer. The data dictionary describes this variable as "Internal event (customer activity) variable - if the account is closed then 1 else 0". We should, therefore, change Exisiting Customer = 0 and Attrited Customer = 1. We will do this in the data engineering section below.

Age

Age is a continuous variable. It ranges from 26 to 73. The mean is about 46 years. We can see that there are no missing values and no true outliers.

Gender

Gender has only 2 distinct values. There are 0 missing. We see that the two options F or M. There appears to be more female customers in the dataset than males.

Dependents

Dependents is a discret, numerical variable. It ranges from 0 to 5. The mean is about 2 dependents. We can see that there are no missing values and no true outliers.

Education

Education only has 7 distinct values. There are 0 missing, however, we see that there are over 1500 unknown. It may be necessary to change these values later. We will have to conduct more EDA first to be sure if this is the best option.

Marital Status

Marital Status has 4 distinct values. There are 0 missing, however, we see that there are over 700 unknown. It may be necessary to change these values later. We will have to conduct more EDA first to be sure if this is the best option.

Income

Income has 6 distinct values. There are 0 missing. Once again, we see that there are over 1100 missing values, however. We can also see that this category is a range of values. This means, there is an order or hierarchy and should probably be maintained for python (Which would not be able to identify that 120k + is more than 40k to 60k. We will fix this column in the data engineering section.

Card

Cards has 4 distinct values. There are 0 missing. We can probably infer that there is also a hierarchy or order to the cards, with Blue being the lowest and Platinum the highest. We will order this column in the data engineering section below.

Months

Months is a discrete, numerical variable. It ranges from 13 to 56. The mean is about 36 months. We can see that there are no missing values and no true outliers.

Products_Held

Products_Held is a discrete, numerical variable. It ranges from 1 to 6. The mean is about 4 products. We can see that there are no missing values and no true outliers.

Months_Inactive

Months_Inactive is a discrete, numerical variable. It ranges from 0 to 6. The mean is about 2 months. We can see that there are no missing values and no true outliers.

Contacts

Contacts is a discrete, numerical variable. It ranges from 0 to 6. The mean is about 2 contactss. We can see that there are no missing values and no true outliers.

Credit_Limit

Credit_Limit is a continuous variable. It ranges from 1438 to 34,516. The mean is about 8,632 dollars. We can see that there are no missing values. The max of almost 35k may be an outlier, we will have to check during univariate analysis

Balance

Balance is a continuous variable. It ranges from 0 to 2517. The mean is about 1163 dollars. We can see that there are no missing values and no true outliers.

Average Open to Buy Credit Line

Average Credit Line is a continuous variable. It ranges from 1438 to 34,516. The mean is about 8,632 dollars. We can see that there are no missing values. The max of almost 35k may be an outlier, we will have to check during univariate analysis

Trans_Changes

Trans_Changes is a continuous variable. It ranges from 0 to 3.4. The mean is about 1 change. We can see that there are no missing values and no true outliers.

Trans_Totals

Trans_Totals is a continuous variable. It ranges from 510 to 18,484. The mean is about 4404 dollars. We can see that there are no missing values. The max value may be an outlier, we will have to examine it more closely using univariate analysis.

Trans_Count

Trans_Count is a continuous variable. It ranges from 10 to 139. The mean is about 65 transactions. We can see that there are no missing values. The max value may be an outlier, we will have to examine it more closely using univariate analysis.

Count_Changes

Count_Changes is a continuous variable. It ranges from 0 to about 4. The mean is about 0.71 d changes. We can see that there are no missing values and no true outliers.

Ratio

Ratio is a continuous variable. It ranges from 0 to 0.999. The mean is about 0.27. We can see that there are no missing values. The max value may be an outlier, we will have to examine it more closely using univariate analysis. We can also see that the values for this ratio all lay between 0 and 1.

Part 4: Data Preprocessing

Let's create a checkpoint here so that we can come back to this dataset later on if we need to...

The first step we will take during the Data Preprocessing stage is to convert some of the value types to values python can more readily understand. For example, the categorical values of Attrition Flag can be converted to binary. Some of the other categorical categories we identified as having a hierarchical order and should be changed as well.

Attrition Flag

The Attrition Flag values were changed to binary format (0,1) in order to match the data dictionary provided by the bank. It is also much easier to run logistic regression using python when the dependent variables are in binary format

Gender

The Gender values are not ranked, however, they can be changed to binary so that we can look for correlations during EDA multivariate analysis. We are basically now asking the question: "Is the customer female?" 0 = no, 1 = yes.

Education

The education values were clearly ordered values. Uneducated would be considered the lowest level achieved, followed by: High School, College, Graduate, Post Graduate and Doctorate. They have been ordered accordingly.

Note: for the time being, we are leaving the 'unknown' values as -1. During the univariate/multivariate EDA stages we may find compelling reasons to replace -1s with more appropriate values.

Income

The income values were clearly ordered values. Customers who make less than 40k per year would be considered the lowest level, followed by: customers who make between 40-60k, 60-80k, 80-120k and finally, 120k +. They have been ordered accordingly.

Note: for the time being, we are leaving the 'unknown' values as -1. During the univariate/multivariate EDA stages we may find compelling reasons to replace -1s with more appropriate values.

Card

The card values were clearly ordered values. Blue would be considered the lowest, or entry level card, followed by: Silver, Gold and then Platinum. They have been ordered accordingly.

We can see that Marital Status is the only object left in the dataset. This is fine because there are different values for the variable and they are not ordered. We will not be able to see correlations properly if we change them to numerical values so it is best to leave as categorical data. We will have to OneHotEncode these values before we build our model.

Part 5: Univariate Data Analysis

Attrition Flag

Age

Gender

Dependents

Education

Marital Status

Income

Card

Months

Products_Held

Contacts

Credit_Limit

Balance

Ave_Credit_Line

Trans_Changes

Trans_Totals

Trans_Count

Count_Changes

Ratio

Part 6: Bivariate/Multivariate Data Analysis

Numerical vs Numerical

We can start by checking the correlation between the numerical data variables by using .corr and a heatmap function.

Due to the numerous variables in the data set, we will first focus on the relationships between the dependent variable, Flag Attrition, and variables we deem important from initial EDA. Second, we will look at relationships of numerical vs numerical values that have moderate or strong correlations. Third, we will focus on numerical vs categorical variables of interest, and finally, we will compare categorical vs other categorical variables.

Dependent Variable vs Numerical Variables

Flag Attriction is the dependent variable and thus the most relevant to the model. We will look at relationships between this variable and other numerical variables first.

Numerical vs Numerical Variables of Interest

Trans_Totals and Trans_Counts have strong, positive correlation, so does Age and Months, as does Balance and Ratio; Income and gender have strong, negative correlation

Categorical vs Numerical Variables of Interest

The three variables that are of biggest concern to us, at the moment, are: Education, Income and Marital Status. These three variables have many unknown values that need to be cleaned. EDA may help us see patterns in the data so that we can make better choices when replacing the unknowns with imputed values.

Income

We can start to see some patterns in the data. First, anyone with credit limits over about 16k have incomes of 1 or more, similarly, people with credit limits of around 25k + have incomes of 2 or more. Also, we notice that ALL women have incomes of 0 or 1, while just the men have incomes ranging from 0 to 4

The above graphs do not give us more insight, it seems that the income levels are almost evenly spread across most of these other variables. We could have assumed that from the lack of correlations found between these variables, but it is good to double check. It seems that our best bet is to use Gender and Credit Limit to create a rule to replace the missing Income values.

Education

The above graphs give us little insight into discernable patterns. It seems that customers of varying education levels are pretty representative of all other variables. It is difficult to find a pattern to follow in order to write a command to impute 'better' guesses for the unknown values. We may just need to use the median or the mode of the entire column.

Marital Status

Its difficult to be sure, but so far, it appears that Unkownn is behaving most closely with Divorced values. We can see similar shapes in the last two charts. Since this variable is unordered, categorical, we cannot replace it with a mean, median or mode. The best option would be to find clusters of data that act similarly and replace the matching unknown with the value of similar clusters. However, if all of unknown is acting as a cluster similar to Divorced, this would be an easy switch. We need to do more analysis to be sure...

We can now change the values to binary format, essentially asking the question, "do we know if the customer is divorced?" 0 = No, we do not know; 1 = yes, we know they are divorced. This simple convertion will allow us to do more EDA and see things like correlations. If there are no correlations between values and Unknown and Divorced, we can assume they behave very similarly and can replace one with the other...

We can start by checking the correlation between the numerical data variables by using .corr and a heatmap function.

There is almost no correlation between Marital_Status and any other variable. Let's perform some EDA on this dataframe..

The shapes of the above graphs seem so similar that it may not be problematic to replace all of the unknown variables with 'divorced'

Categorical vs Categorical Variables

Note: Some of these variables have been compared already after they were converted to ranked, numerical data. Here, we will compare the original variables from the dataframe dataset, before any cleaning or preprocessing occured.

Part 7: Data Engineering

Unknown Values

We know that there are three columns with unknown values: Education, Marital Status and Income. Lets see if we can impute some of these unkown values with group modes (since there are all categorical).

Income

Income is a good place to start because we know its highly correlated with gender (negative). There is also moderate correlation with Credit Limit. We can use these two variables to help us predict the Income of customers where the value = unknown.

We know from EDA that, unfortunately, that there are 0 instances of women who have incomes of category 2, 3 or 4. In other words, all known incomes of females is either less than 40k or between 40k and 60k. We can double check...

We also know that there are no instances of below 40k where Credit Limit is greater than 16000. Similarly, we know there are no instances where credit limit is between 40k and 60k and credit limit is greater than 24000. We assume that these are bank imposed Credit Limits based on income. We will maintain these set limits while we attempt to classify the unknown income values. Lets check and see if there are instances where Gender = Female, Income = Unknown AND Credit Limit > 24000

We see 96 instances where women have credit limits above 24,000. This means that their incomes have to be at least 60k per year or more, based on our above assumption. Therefore, at least one more income class for women will be created. Since the mean income for woman is 0 (or below $40k), we will combine these two factors and assume these women have incomes = 2.

Let's look at four examples of where Income = -1 (unknown). The first, row 94, is female with credit limit below 16000; the second, row 151, is male; the third row, 1925, is female with credit limit greater than 24000; finally, the 4th row, 7086, is female with credit limit between 16000 and 24000.

Ideally, we would like to replace the unknown incomes with 0 if the customer is female or 2 if the customer is male. However, we assume women with Credit Limits greater than 24k must have incomes of 2 or larger. Since it is impossible for us to know if they have incomes of 2,3 or 4, we will replace the -1 for these women with 2, since it is the closest to the group median (0). Therefore, we will write code that will replace Income = -1 & Gender = 0 (male) with the group median, 2, and code that will replace Income = -1 & Gender = 1 (female) with 0, unless their Credit Limit is above 16000 and below 24000, which will trigger a replacement of 1, or 2 if their credit limit is greater than 24000.

Re-examining rows 94, 151, 1925 and 7086 we should now see that the Income for row 151 = 2. This is the median grouped by gender (male). For rows 94, 151, and 1925, the group median is 0 (female), but we have to ensure that the Incomes are replaced with the closest value to zero that still respects the Credit Limit maximum assumed. Therefore, row 94 should have Income = 0, row 1925's Income = 2 and row 7086 should have income = 1. Lets check..

We can see that incomes of 0 and 1 retained their Credit Limit caps around 16000 and 24000, respectively, while incomes of 2, 3 and 4 enjoy caps around 35000. A new income level for women was created, however. There are now women who make between 60k and 80k per year, putting them in the '2' category. We found no reason why this should not be allowed. It seemed more prudent to keep the Credit Limit boundaries, which were probably pre-set by the bank.

Education

Education is a bit trickier than Income because there were no correlations between this variable and any other variable. This does not mean, however, that there are no patterns. We can start by seeing if gender plays a role in education, followed by age...

The means are all slightly more than 0.5. Because there are more women customers, we would expect the means to be slightly closer to 1.0 if there was a near equal representation of male and female for each group.

Again, we are not seeing big discrepencies. It would be logical to see older individuals having more education, since it takes longer to acquire higher levels of education, but we are not seeing it in the data.

We know from EDA visual graphing that there are far more 3s than 2s. We can also assume that the -1 values included are dragging the median towards the left (or closer to 0). Therefore, in this case, we will replace unknown with the mode of the column

We can see that the -1 value for education in row 15 was changed to the column mode, or 3.

Marital Status

Marital Status is an unordered categorical variable, so we cannot perform correlation analysis on it. As such, we need to look for some patterns like we did with Education. During Bivariate/Multivariate analysis, we discovered that the unknown values were behaving very similar to the divorced values across most other variable types (ie they had very similar relationships). As a result, we plan on just replacing the unknowns with divorced.

We can compare the previous graphs of separated Divorced and Unknown with graphs of the new combined variable to see if they are similar.

Multicollinearity

We know that Credit Limit and Average Credit Line are perfectly correlated. This means that the two will work in tadem to influence the model in a certain direction. This phenomenon is called multicollinearity. In princple, since the two variables are telling us the same thing, it is prudent to drop one of the columns so that their affect is not amplified. Let's drop the Average Credit Line column...

Correlation

Spliting the dataframe into categorical columns so that we can evaluate corrlation of categorical data using Cramer's V function.

The above heatmap shows that catigorical columns are not correlated with customer churn by themselves.

The above heatmap allows us to see correlations between the numerical data and churn. We can rank them using the following code...

We have previously defined 'no-correlation' as values between +0.1 and -0.1. As such, we can see that there is no correlation between churn and the following variables:

Since there is no correlation between these variables and the dependent variable, we can remove these from the model. We have already dropped the Average Open to Buy column (renamed Ave_Credit_Line) above due to multicollinearity

One Hot Encoding Categorical Variables

Our next step in data preprocessing is to convert the categorical columns to binary. We have already converted Education, Income and Card columns because we wanted to retain their ranked, hierarchical structure. However, with Marital Status, there was no hierarchy, so this was not possible. Now, we will write code so that Marital Status is converted to three columns: Married, Single and Divorced. We will then drop one column since, the way python works, if a customer scores 0 in both Married and Single, it MUST score a 1 in the Divorced column. Python knows this, so the third column is not necessary.

As we eluded to above, we will drop the Divorced column since it has the lowest number of customers. We also must drop the original Marital_Status column so that this information is not counted twice...

Part 8 - Methodology

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a credit card account is closed when it is actually open - False Positive - no significant loss
  2. Predicting a credit card account is open when it is actually closed - False Negative - business loss

Which case is more important?

  1. If the model predicts an account is closed when it really is not (FP), then there is no significant loss - the customer will continue to use the account and be charged fees and interest
  2. If the model predicts an account is open when it really is closed (FN), the Bank is losing money in terms of fees and interest charges and is unable to take corrective action by reaching out to the customer to lure them back.

Which metric to optimize?

Selecting the Best Model

In an effort to select the very best model possible, we will have to construct various individual models and apply different ensemble techniques, cross-validation methods and model tuning procedures. Ultimately, we will conduct 4 different approaches in our effort to select the best model for the Bank's goals.

Approach 1 - Logistic Regression

Approach 2 - Bagging and Boosting

Approach 3 - Hyperparameter Tuning using GridSearchCV and RandomSearchCV Pipelines

Approach 4 - Reclassifying Unknown Values using KNN Imputer

Approach 5 - Reclassifying Unknown Values using OneHotEncoder

Part 9 - Approach 1 - Logistic Regression

Model Building - Approach

  1. Partition the data into train and test set.
  2. Build a Logistic Regression model on the data sets
  3. Improve model performance by: a) upsampling the minority data b) downsampling the majority data
  4. Regularize the models

Scaling the Dataset

State Definitions

Build Models

Model 1: Simple Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Model 2: Oversampling train data using SMOTE

Logistic Regression on oversampled data

We can evaluate the model performance by using KFold and cross_val_score...

Model 3: Undersampling train data using Random Under Sampler

Logistic Regression on undersampled data

We can evaluate the model performance by using KFold and cross_val_score...

Regularization

Model 1: Simple Logistic Regression

Regularized RIDGE Model

Regularized LASSO Model

Comparing the Scores

Both Ridge and Lasso are helping to reduce the complexity of the simple regression model, but at a high cost. The recall scores have dropped substantially.

Model 2: Oversampling train data using SMOTE

Regularized RIDGE Model

Regularized LASSO Model

Comparing the Scores

Both Ridge and Lasso are helping to reduce the complexity of the simple regression model, but at a high cost. The recall scores have dropped substantially, especially on the test data.

Model 3: Undersampling train data using Random Under Sampler

Regularized RIDGE Model

Regularized LASSO Model

Comparing the Scores

Both Ridge and Lasso are helping to reduce the complexity of the simple regression model, but at a high cost. The recall scores have dropped substantially, especially on the test data.

Comparing All 6 Models

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Approach 1 - Conclusions

Approach 1 - Business Insights

Part 10 - Approach 2 - Bagging and Boosting

Model Building - Approach

  1. Partition the data into train and test set.
  2. Build a Decision Tree Model
  3. Build a Bagging models on the data sets
  4. Build Boosting models on the data sets

Split Data

State Definitions

Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix...

Decision Tree Classifier

Hyperparameter Tuning

Ensemble Techniques: Bagging

Bagging Classifier

Observations

Random Forest Classifier

Observations

Hyperparameter Tuning

Bagging Classifier

Some of the important hyperparameters available for bagging classifier are:

Let's check different metrics for bagging classifier with best hyperparameters and build a confusion matrix.

Observations

Using logistic regression as the base estimator for bagging classifier:

Observations

Random Forest Classifier

We will try to improve the model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:

Observations

Using class_weights for random forest:

Observations

Comparing all Bagging Models

Observations

Ensemble Techniques: Boosting

AdaBoost Classifier

Hyperparameter Tuning

Gradient Boosting Classifier

Hyperparameter Tuning

XGBoost Classifier

Hyperparameter Tuning

Comparing all Boosting Models

Observations

Approach 2 - Conclusions

Approach 2 - Business Insights

Part 11 - Approach 3

Hyperparameter Tuning using GridSearchCV and RandomSearchCV Pipelines

Model Building - Approach

  1. Split the data into train and test set
  2. Use K-Fold cross-validation, StandardScaler and Pipelines to build different classifiers
  3. Find the best cross-validation score by tuning the model using hyperparameters
  4. Build the model and test its performance on the test set.

Split data

Hyperparameter Tuning

We will use pipelines with StandardScaler to tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance of the top three methods methods.

First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

XGBoost

XGBoost was our top performing model prior to hypertuning. The cross-validation score on the train data was: 85.34

GridSearchCV

RandomizedSearchCV

AdaBoost

AdaBoost was our second best performing model prior to hypertuning. The cross-validation score on the train data was: 82.97

GridSearchCV

RandomizedSearchCV

Gradient Boosting

Gradient Boosting was our third best performing model prior to hypertuning. The cross-validation score on the train data was: 82.53

GridSearchCV

RandomizedSearchCV

Comparing models from GridSearchCV and RandomisedSearchCV

Feature Importance

We can check the feature importances of the top model: XGBoost with GridSearchCV

Approach 3 - Conclusions

Approach 3 - Business Insights

Part 12 -Approach 4 - Reclassifying Unknown Values using KNN Imputer

Model Building - Approach

Here, we will attempt to build upon the best model but using alternative approaches to cleaning the data and other preprocessing teqchniques to see if we can further improve the model. In Approach 4, we will use knn imputer to determine the best replacement values for the 'unknown' values in the columns: Education, Marital Status and Income

Split data

Marital_Status is an unordered, categorical variable, however, now that the names have been reassigned as numbers, python will think that Divorce = 3 is greater than Single = 2 is greater than Married = 1. We can now convert them back and use OneHotEncoding. Any transformations to the dataset have to be done separately now to prevent data leakage, since the data has already been split. Therefore, we will first transform the names in X_train, then again in X_test...

Did this approach improve the cross-validation scores? Lets compare...

The table shows that, across the board, the knn imputer is giving higher cross-validation recall scores. This could mean that knn imputer is making some of the unknown variables more correlated, but it could also mean that it is more exact than our manual impute approach. Afterall, there were assumptions made when we imputed unknown values manually. We will test the tree best models... XGBoost, Gradient Boosting and AdaBoost to check for overfitting.

Hyperparameter Tuning

We will use pipelines with StandardScaler to tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance of the top three methods methods.

First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

XGBoost

XGBoost was our top performing model prior to hypertuning. The cross-validation score on the train data after using knn imputer was: 86.82

GridSearchCV

RandomizedSearchCV

Gradient Boosting

Gradient Boosting was our second best performing model for knn imputed data prior to hypertuning. The cross-validation score on the train data was: 84.02

GridSearchCV

RandomizedSearchCV

AdaBoost

AdaBoost was our third best performing model prior to hypertuning. The cross-validation score on the train data was: 83.67

GridSearchCV

RandomizedSearchCV

Comparing models from GridSearchCV and RandomisedSearchCV

Approach 4 - Conclusions

Approach 4 - Business Insights

Part 13 - Approach 5 - Reclassifying Unknown Values using OneHotEncoder

Model Building - Approach

Here, we will attempt to build upon the best model but using alternative approaches to cleaning the data and other preprocessing teqchniques to see if we can further improve the model.

In Approach B, we will leave the unknown values as a separate category. When using OneHotEncoder to transform categorical columns to 'on' or 'off', it is customary to drop one column. This is because whichever column is dropped is assumed to be 'on' if and only if all other columns are 'off'. In this case, we will drop the unknown columns.

Note: Here, we will use dataset TB3, which was only partially cleaned (column names) and put aside earlier for this purpose. We can take a quick look at the shape of the dataset...

The first step is to convert the dependent variable to binary, after which, we can OneHotEncode the categorical variables...

Split data

We can see that on the best three models, the OneHotEncoder method actually has higher cross-validation scores

Hyperparameter Tuning

We will use pipelines with StandardScaler to tune the model using GridSearchCV and RandomizedSearchCV. In this instance, we will only compare the top model - XGBoost - since we know it outperforms the other models.

First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

XGBoost

XGBoost was our top performing model prior to hypertuning. The cross-validation score on the train data after using knn imputer was: 86.82

GridSearchCV

RandomizedSearchCV

Approach 5 - Conclusions

Approach 5 - Business Insights

Part 14 - Overall Conclusions

Approach 1

Approach 2

Approach 3

Approach 4

Approach 5

Overall Business Insights

Business Applications

Appendix 1 - Apply Module 4 Techniques to Improve Model Further

Note: Up until this point in the project, I have done everything to the best of my abilities to fulfill the assignment criteria. Below, I am able to improve the model even further by applying a Receiver Operating Characteristic for training data and finding optimal threshold. The optimal threshold was then applied and fitted to the test data, improving the Accuracy and Recall scores and reducing False Negatives further. Despite being outside the scope of the assignment, this application of a previously learnt technique on the best model found while completing the criteria of this assignment, was used to complete the adjacent report. The following was completed for my own personal learning benefit....

The optimal model threshold is predicted to be about 0.18. Lets check the model scores while using the optimal threshold

Appendix 1 - Business Insights

Appendix 2: Analyzing Misclassified Predictions

From our best model above, we see that there are 16 False Negatives. These are essentially misclassified predictions. The accounts of these customers have actually gone into attrition but the model predicted they would remain open. This is the biggest cost facing Thera Bank. We can examine the individual misclassified predictions using the technique below. Note: there are also 121 misclassified False Positives, but we are not concerned with these, so analysis of the False Positives will be skipped.

False Negatives - Test Data

The data seems to match. We now are faced with the difficulty that the TB_predict1 values are also unordered. We can solve this by attaching TB_predict1 to the X_test data, sorting the data, and then dropping all the columns except for the Predicted column. This is a bit of a work around because TB_predict1 is not a dataframe, has no axis, and cannot be sorted on its own.

We know that False Negatives, the predictions of interest, occur when the data shows the customer account is ACTUALLY closed, but the model Predicts that it is open. This information is stored in the dataset wherever Flag = 1 and Predicted = 0. We know from the confusion matrix above that this should occur 16 times. Lets create a workable dataset of just False Negatives, called FN1...

Appendix 2 - Business Insights